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Abstract 

In this paper, we apply a recently developed nonparametric modeling ap¬ 
proach, the “diffnsion forecast”, to predict the time-evolntion of Fonrier 
modes of tnrbnlent dynamical systems. While the diffnsion forecasting method 
assnmes the availability of a noise-free training data set observing the fnll 
state space of the dynamics, in real applications we often have only partial 
observations which are corrnpted by noise. To alleviate these practical is- 
snes, following the theory of embedology, the diffnsion model is bnilt using 
the delay-embedding coordinates of the data. We show that this delay em¬ 
bedding biases the geometry of the data in a way which extracts the most 
stable component of the dynamics and reduces the influence of independent 
additive observation noise. The resulting diffusion forecast model approxi¬ 
mates the semigroup solutions of the generator of the underlying dynamics 
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in the limit of large data and when the observation noise vanishes. As in any 
standard forecasting problem, the forecasting skill depends crucially on the 
accuracy of the initial conditions. We introduce a novel Bayesian method for 
hltering the discrete-time noisy observations which works with the diffusion 
forecast to determine the forecast initial densities. 

Numerically, we compare this nonparametric approach with standard 
stochastic parametric models on a wide-range of well-studied turbulent modes, 
including the Lorenz-96 model in weakly chaotic to fully turbulent regimes 
and the barotropic modes of a quasi-geostrophic model with baroclinic insta¬ 
bilities. We show that when the only available data is the low-dimensional 
set of noisy modes that are being modeled, the diffusion forecast is indeed 
competitive to the perfect model. 

Keywords: nonparametric forecasting, kernel methods, diffusion maps, 
diffusion models, time-lagged embedding, diffusion forecast. 


1. Introduction 

A long-standing issue in modeling turbulent dynamics is the so-called 
turbulent closure problem (see e.g. [1]) where the goal is to hnd a set of ef¬ 
fective equations to represent low-order statistics of the coarse-grained vari¬ 
ables of interest. The main difficulty of this problem is largely due to the 
inhnite dimensionality and nontrivial coupling of the governing equations of 
the statistics. In order to predict a few lower-order statistics of some resolved 
variables, common closure approaches were developed using physical insights 
to choose a parametric ansatz to represent the feedback from the unresolved 
scales (see e.g., [2] for various closure approximations for predicting passive 
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scalar turbulence and [3, 4, 5, 6] for various stochastic modeling approaches 
for predicting geophysical turbulence). 

Despite these successes, the parametric modeling approaches have prac¬ 
tical issues due to model error when the necessary physical insights are not 
known. If the parametric model (or ansatz) is not chosen appropriately, one 
can end up with a model with poor predictive skills (or even with solutions 
which diverge catastrophically) even when the parameters can be obtained 
by a standard regression htting procedure [7]. Moreover, even when an ap¬ 
propriate parametric form is chosen, specifying the parameters from noisy 
observations of the physical variables can be nontrivial since the parameters 
are typically not directly observed. Indeed, it was shown that an appropriate 
parameterization scheme is crucial for accurate hltering and equilibrium sta¬ 
tistical prediction even when the parametric forms are appropriately chosen 
[ 8 ]. 

Recently, a nonparametric modeling approach, called the diffusion fore¬ 
cast, for predicting the evolution of the probability density of low-dimensional 
dynamical system was introduced in [9]. The approach of [9] can be intu¬ 
itively viewed as extending the standard nonparametric statistical models 
(such as kernel density estimates) which are used to estimate time-independent 
densities [10]. The key idea behind the diffusion forecast is to use a basis 
of smooth functions to represent probability densities, so that the forecast 
model becomes a linear map in this basis. Numerically, this linear map is 
estimated by exploiting a rigorous connection between the discrete time shift 
map and semi-group solution associated to the backward Kolmogorov equa¬ 
tion. In [9], it was shown that the resulting model estimates the semigroup 
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solutions of the generator of the underlying dynamics in the limit of large 
data. Moreover, the smooth basis is dehned on the training data set, using 
the diffusion maps algorithm [11, 12], which means that the data require¬ 
ments only depend on the intrinsic dimensionality of the dynamics. 

In this paper, we test this nonparametric modeling approach as a method 
of forecasting noisy observations of Fourier modes from a selection of well- 
studied high-dimensional dynamical systems in various turbulent regimes. A 
novel aspect of this paper is that we consider building a forecasting model 
given a noisy training data set consisting of partial observations of the dy¬ 
namics, as is common in practical applications, in contrast to the work in [9] 
which used noiseless full observations to train the diffusion forecasting model. 
A key ingredient for solving initial value problems in any forecasting problem 
is accurate initial conditions. While initial conditions were assumed to be 
given in [9], in this paper, we introduce a novel Bayesian hltering method to 
iteratively assimilate each observation and find the initial probability densi¬ 
ties given all of the past noisy observations up to the corresponding initial 
time. 

We should note that the diffusion forecasting method [9] could be naively 
applied to signals corrupted by observation noise, however the resulting non¬ 
parametric model would implicitly include the observation noise in the model, 
which would limit the forecast skill compared to treating the noise as a sep¬ 
arate process. Treating the noise as a separate process requires hrst learn¬ 
ing the ‘correct’ model from the noisy training data set, and then generat¬ 
ing ‘clean’ initial conditions for forecasting from the noisy observations. In 
[13, 14, 15] it was shown that applying diffusion maps to the delay-embedded 
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data reduces the influence of the noise on the diffusion maps basis. Building 
upon the work in [9], we apply the theory of [13] to show that building the 
nonparametric model using the delay-embedded data biases the geometry 
of the data in a way which extracts the most predictable component of the 
dynamics. We extend the theory of [13] by giving a rigorous justification for 
the reduction of the influence of independent additive observation noise on 
the resulting diffusion forecast model. 

One interesting question which we address here is whether it is possible to 
build a skillful nonparametric forecasting model for a turbulent mode given 
only a small amount of noisy training data, when the true dynamics are so¬ 
lutions of a high-dimensional dynamical system with chaotic behavior. This 
question arises because the nonparametric model has a practical limitation 
in terms of modeling dynamics with high-dimensional attractors, namely: it 
will require an immense amount of data to unwind the attractors since the 
required data increases exponentially as a function of the dimension of the 
attractor. Moreover, even given a sufficiently large data set, the required 
computational power would be a limiting factor since the diffusion maps 
algorithm requires storing and computing eigenvectors of a sparse N x N 
matrix, where N is the number of data points. Constrained by a small data 
set, the curse-of-dimensionality implies that we cannot unwind the full high¬ 
dimensional attractor. We attempt to circumvent the curse-of-dimensionality 
by decomposing the data into Fourier modes in the hope that delay recon¬ 
struction of each mode projects onto a different component of the dynamics. 
We do not claim that the Fourier decomposition can completely resolve this 
issue but we will numerically demonstrate that the Fourier decomposition 
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will map an isotropic turbulent field in the spatial domain (which implies 
that each spatial component is as predictable as any other spatial compo¬ 
nent) to a coordinate system in which some modes are more predictable 
than others. Of course, the standard theory of embedology [16] suggests 
that the delay-embedding of a single Fourier mode would reconstruct the 
entire high-dimensional attractor, which would again be inaccessible to our 
nonparametric model due to the curse-of-dimensionality. This would suggest 
that nothing could be gained by building separate models based on delay¬ 
embedding of each mode. However, the full attractors reconstructed from 
each mode are only equivalent in a topological sense, and the geometries of 
these reconstructed attractors are dramatically different. The biased geome¬ 
try influences the nonparametric model of [9] through the use of the diffusion 
map algorithm which is known to preserve the geometry which the data in¬ 
herits from the embedding space [11, 13]. The diffusion maps algorithm 
preserves the biased geometry of the delay embedding as was shown in [13]; 
and we will see that this biased geometry projects the full dynamical system 
onto the most stable components of the dynamics in the direction of the cho¬ 
sen observations. When we apply the nonparametric model of [9] using the 
basis arising from this biased geometry, we find improved forecasting skill 
and robustness to observation noise. 

The remainder of this paper is organized as follows. In Section 2, we in¬ 
troduce the problems under consideration and establish the necessary back¬ 
ground, including a brief overview of the nonparametric modeling approach 
introduced in [9] as well as a discussion on how the theory of [13] is applied to 
mitigate the effect of noise on the model. We conclude Section 2 by introduc- 
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ing the iterative Bayesian filter which we use to generate initial conditions 
for forecasting with the nonparametric model. In Section 3, we numerically 
compare predicting Fourier modes of the Lorenz-96 model in various chaotic 
regimes using the nonparametric model with the persistence model, perfect 
model, and various parametric models, including the autoregressive models 
of order-1 (MSM [17]). In Section 4, we numerically compare the nonpara¬ 
metric model with a stochastic model with additive and multiplicative noises 
(SPEKF model [18, 19]) in predicting the barotropic modes of a geostrophic 
turbulence. We close this paper with a short summary in Section 5. 

2. Nonparametric diffusion modeling 

Let u{x,t) G be solutions of an ergodic system of nonlinear PDFs, 



( 1 ) 


where A denotes nonlinear differential operators, for smooth initial conditions 
u{x, 0) and periodic boundary conditions on a non-dimensionalized periodic 
domain x G [0,27r]"'. To simplify the exposition, set s,n = 1 without loss of 
generality. Here, the solutions of (1) can be described by the inhnite Fourier 
series. 



where the Fourier modes Uk can be utilized in analyzing (1). 


Dehne Ut = {u{xo,ti), ■ ■ ■ ,u{x 2 m,ti)) ^ whose components are the 

solutions of (1) at time realized at grid point Xi = ih, i = 0,..., 2m, such 
that (2m + l)h = 2 ti. Our goal is to construct a forecasting model for the 
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discrete Fourier coefficients, 



2m 


( 2 ) 


given the corresponding observed modes, Vi^k = (hi, where the £-th com¬ 
ponent of Fj, 


v{xi, ti) = u{xi, ti) + Ei^i, Ei^i ~ 7\^(0, R), 


( 3 ) 


is the solution of (1), corrupted by i.i.d. Gaussian noise, at the grid point xi 
and time step U. Notice that {e^}e=o,±i,...,±m forms an orthonormal basis of 


£ 2 m+i respect to the inner product dehned in (2). One should also note 
that Vi^k = Ui,k + Vi,k, where rji^k ~ A/'(0, R) and R = R/{2m + 1) [20, 17]. 


Given the noisy time series of modes F := {vi^k}keK,{i=i,...,N}, our goal 
is to use this noisy data set to train the nonparametric model and to gen¬ 
erate initial conditions for forecasting modes Ui := {ui^k}keK: at time index 
i > N. Particularly, we will consider /C to be a set containing a single mode 
or containing three modes in our numerical experiments. In Section 2.1, we 
provide an overview of the construction of the nonparametric model intro¬ 
duced in [9] for fully observed data (all modes) without observation noise. In 
Section 2.2, we show that by applying a delay embedding on noisy data, Vi^k, 
we can improve the forecasting skill of the underlying mode, Ui^k- Finally, in 
Section 2.3, we introduce a simple Bayesian hltering method for generating 
initial conditions for forecasting Ui from the noisy observations hj. 

2.1. Nonparametric forecasting model 


In this section, we review the nonparametric diffusion forecasting model 
introduced in [9], assuming the data set consists of full observations of the 


dynamical system with no observation noise. Consider a system of SDEs, 


du = aiii) dt + b{u) dWt, 'u(O) = ito, 


( 4 ) 


where a{u) is a vector field, b{u) is a diffnsion tensor, and Wt is an i.i.d. Wiener 
process; all defined on a manifold C M". Assnme that the dynamical sys¬ 
tem governed by (4) is ergodic on Ai with an invariant measnre with density 
fnnction Peq{u). We also assnme that the evolntion of a density fnnction is 
given by p{u, t) = e^^*p{u, 0) which converges to the invariant measnre Peq{u), 
where C* denotes the Fokker-Planck (or forward) operator andp('U, 0) denotes 
an initial density. The approach we will describe below is nonparametric in 
the sense that it does not assnme any parametric strnctnre on the dynamics, 
a, b, the distribntion, p, or even the manifold, A4, which will all be implicitly 
represented in the model. However, this does not mean that the method does 
not have parameters, and these parameters are ronghly analogons to the bin 
size in a histogram. 

Given a time series {ui = onr goal is to approximate p{u,t) 

withont knowing or even explicitly estimating a, b. Instead, we will directly 
estimate the semigronp solntion associated to the generator C of (4) by 
projecting the density onto an appropriate basis for L^(A4, Peq)- In particnlar, 
we choose eigenfnnctions {pj} of the generator £ of a stochastically forced 
gradient system. 


du = —'VU{u) dt -|- \/2dWt, 


( 5 ) 


where the potential fnnction U = — log(peg) is dehned by the invariant mea¬ 
snre of the fnll system (4) so that the invariant measnre of (5) is peq = e~^ = 
Peq- This choice of basis is motivated by several considerations. First, we can 
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estimate the eigenfunctions {^j} by the diffusion maps algorithm for data 
lying on a compact manifold [11] and on a non-compact manifold [12]. In 
this paper, we will use the variable bandwidth kernel introduced in [12] to 
construct these eigenfunctions since the sampling measure of the data set 
may be arbitrarily small, which would imply that the data does not lie on a 
compact manifold. Second, by projecting the density function on this basis 
of eigenfunctions, we can accurately forecast each projected component by 
a discrete representation of the semigroup solution where r = tj+i — f*. 
We now show how to construct the discrete representation of e'^'^ using the 
“shift operator”. 

Let {(pj} be the eigenfunctions of the generator C of the gradient system 
in (5); these eigenfunctions are orthonormal under (•, ■)p^^ in L‘^{Ai,Peq) and 
the integral here (and in all of the inner product dehned below) is with respect 
to the volume form dV which Ai inherits from the ambient space. Note that 
C is the generator of gradient system in (5) and it is the adjoint of the Fokker- 
Planck operator C* with respect to inner-product (•, •) in L‘^{Ai). One can 
show that {'ijjj = PjPeq} are eigenfunctions of C* which are ortho normal with 
respect to inner-product {■,-)peq-^ L‘^{-M.,Peq~^). Given an initial density 

p{u, 0) = po{u), we can write the forecast density p{u, t) as, 

p{u,t) = e*^*po{u) = = '^{PoA^^Pj)Pj{u)Peq{u). (6) 

i j 

Setting t = 0 in (6) we hnd, 

Po{u) = p{u, 0) = Pj)Pj{u)Peq{u) = Cj{0)ipj{u)peq{u), (7) 

j 3 
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where we define Cj(0) = {pQ,(pj). Substituting (7) into (6), we obtain, 

p{u,t) = EE ci{0){pi,e*^Pj)p^^iPj{u)Peg{u). (8) 

j I 

The key idea of the non-parametric forecasting algorithm in [9] is to approx¬ 
imate Aji = {(pi,e^^(pj)p^^ by replacing the semi-group solution e’"'^ with the 
discrete time shift operator S, which is defined as Sf{ui) = /(fij+i) for any 
/ G L‘^{A4,peq). In [9], it was shown that S' is a stochastic estimate of 
and the error due to the stochastic terms can be minimized by projecting S 
on the basis ipj. Indeed it was shown that Aji = {pi, Sipj)p^^ is an unbiased 
estimate of Aji, meaning that IE[Aj 7 ] = Aji. Furthermore, assuming that Ui 
are independent samples of peq, one can show that the error of this estimate 
is of order which means that one can apply this approximation for 

any sampling time r given a sufficiently large data set N. 

Notice that for any / G C‘^{Ai,Peq), we have: 

j 

= ( 9 ) 

j 

From the ergodicity property of (4), one can deduce that the largest eigen¬ 
value of e’"'^ is equal to 1 with constant eigenfunction, l(-u). Setting / = 1 
in (9), we have, 

j 

which implies that '^j)peq = (1) '^i)p^q or in compact form, Aei = Ci, 

where [e^j = (l,9?j)peq) so Ci is 1 on the first component and zero otherwise. 
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Assuming that the data is sampled from p^g, we just deduced that the largest 
eigenvalue of A should be equal to 1 (with no complex component). Numeri¬ 
cally however, we sometimes hnd that A has some eigenvalues slightly larger 
than one, due to the hnite number of samples in the Monte-Carlo integrals. 
If this occurs, we can easily ensure the stability of the diffusion forecast by 
dividing any eigenvalue with norm greater than 1 so that is has norm equal 
to one. We note that in our numerical experiments below this issue rarely 
occurs. The ease with which we can guarantee the stability of this nonpara- 
metric model is an advantage over many parametric methods. 

We should also note that if (4) is a gradient system as in (5), then Aji = 

where \j are the eigenvalues of C = C, which can 
be obtained directly from the diffusion maps algorithm [11, 12]. Moreover, 
the matrix A becomes diagonal (due to the orthonormality of ipj) so that 
the diffusion maps algorithm estimates A directly and the shift operator 
approximation is not required. See [21] for various uncertainty quantihcation 
applications for this special case. 

To conclude, the non-parametric forecasting algorithm (which we refer 
to as the diffusion forecast) for a given initial density po{u) is performed as 
follows: 

1. Learning phase. Given a data set apply the diffusion maps 

algorithm [11, 12] to obtain eigenvectors {(pj}, whose Tth component, 
= (pj{ui), approximates the eigenfunction ipj of the gradient flow 
(5) evaluated at the data point iti. We implement the diffusion maps 
algorithm with a variable bandwidth kernel, see the Appendix for the 
step-by-step algorithm to obtain these eigenfunctions. Also, see the 
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supplementary material of [9] for a short overview of [12], 

2. Initial conditions: Represent the initial density po in this basis as, 

Cj{0) = {P0,I>j) = {Po/Peq,I>j)p,, = ^ (10) 

j=i Peq\'>^i) 

which is numerically obtainable from the last inner product as a Monte- 
Carlo integral since we have pj evaluated on the data set Ui which are 
samples of p^q- 

3. Forecasting: Apply a Monte-Carlo integral to approximate the shift 
operator S in coordinate basis pj, 

. N 




( 11 ) 


2=1 


We then approximate Aji with Aji such that the diffusion forecast is 
given by 


p{u, mr) ^ EEW )jlCi{0)ipj{u)peq{u) =^Cj{mT)ipj{u)peq{u),{l2) 
j I j 

where in practice, these sums are truncated at a hnite number, M, of 
eigenfunctions. With this truncation, the coefficient Cj(mr) is the j- 
th component of a matrix-vector multiplication, c{mT) = y4'"c(0), of an 
MxM matrix A™ and an M-dimensional vector c(0) = (ci(0),..., cm(O))''". 


2.2. Time-lagged embedding 

The diffusion forecast algorithm of Section 2.1 assumes that the data Ui 
are sampled directly from the full dynamical system (4). That is, the full 
system u in (4) is equivalent to the dynamical system for u and the tur¬ 
bulent dynamics considered here will be very high dimensional. For such 
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high dimensional problems, exploring the entire attractor would require a 
prohibitively large data set; larger than will typically be available in applica¬ 
tions. Instead, we attempt to build a low-dimensional model for each mode 
Ui^k individually. An individual mode Ui^k is simply a linear projection of the 
full system Uj, and moreover it is an ‘observation’ of the spatial state u. While 
we could apply the diffusion forecast to the one dimensional time series iti^k, 
this would ignore the dependence of the fc-th mode on all the other modes. 
The fundamental problem with building a one-dimensional stochastic model 
for Ui^k is that any interaction with the other modes will result in (among 
other changes to the model) an inflated stochastic component in the closed 
model for Ujfc (see [22, 8] for a rigorous example). Inflating the stochastic 
component of the model for Ui^k will severely limit the predictive skill of 
the nonparametric model. On the other hand, if we include other modes 
in the nonparametric model, this would increase the dimensionality and not 
all of the information in the other modes would be relevant to forecasting 
the k-th. mode. Instead, we will apply the delay coordinate reconstruction 
of [23, 16, 24, 25] to implicitly recover only the missing components of the 
dynamics which are important for forecasting each Uj fc. Moreover, we will 
apply the theory of [13] to show that the delay coordinate reconstruction 
projects the missing components of the dynamics onto the most stable com¬ 
ponent of Ui^k- Finally, in practice we will only have access to a noisy data set 
Vi^ki and the theory of [13] suggests that the delay coordinate reconstruction 
also reduces the influence of the observations noise, which we will now make 
rigorous. 

Given a time series iii^k = UkiU) of the k-th mode, we construct the delay 
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embedding coordinates, 


■ ■ ■ 1 ^i—L,k) 

The theory of embedology shows that if the k-th mode is a generic observa¬ 
tion of the full system (4), then for sufficiently many lags, L, there exists a 
diffeomorphism such that Ui^k = This statement was hrst shown for 

deterministic dynamics on smooth attractors in [23] and subsequently gener¬ 
alized to fractal attractors in [16] and then to non-autonomous systems in [24] 
and stochastic systems in [25]. The existence of the diffeomorphism says 
that topologically the attractor of the delay reconstruction Ui^k is equivalent 
to the full dynamics iii and so we have reconstructed the hidden variables 
in the evolution of the fc-th mode. However, the theory of embedology only 
concerns the topology of the attractor, whereas the basis of eigenfunctions 
{(Pj} that are estimated from the diffusion maps algorithm will depend on 
the geometry of the delay reconstruction Ui^k- 

In [13] it was shown that the delay reconstruction severely biases the 
geometry in the reconstructed coordinates Ui^k, so that in the limit of inhnite 
delays, L —)■ cx), the dynamics are projected on the most stable component 
of the dynamics. Consider the fc-th mode Ui^k as an observation of the state 
iti, where the observation function is given by, 

Ui,k = hkiiii) = ( 0 ,..., 0 , 1 , 0 ,..., 0)ui. 

Notice that the derivative of the observation function is Dhk = (0,..., 0,1, 0,..., 0) 
where the 1 occurs in the fc-th entry. Moreover, the previous value of the fc-th 
mode, iti-i^k, can also be considered an observation of iti where the observa¬ 
tion function is given by iii-i^k = hk{F_T-{ui)) and the map F_^{ui) = -Uj_i is 
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given by the reverse time evolution for the discrete time step r. Interpreting 
each Ui-i^k as an observation of the full state u at time ti shows that the 
time-delay embedding iii^k is an observation of Ui with observation function 
H as follows, 

^i,k H{Ui k^ hk(^F—ri^Ui)')y ■ ■ ■ y , 

where F^ir is the reversed shift map which takes Ui to Ui-i. For L sufficiently 
large and assuming that the fc-th mode is a generic observable on the man¬ 
ifold, the observation FI will have a full rank derivative [16]. Our goal now 
is to examine the change in metric induced by this derivative by seeing how 
it acts on tangent vectors on the attractor. Let £ TuM. be tangent 

vectors on the attractor and let = DH{u)vi and z /2 = DH{u)h >2 where 
C>i,C >2 £ Th{u)H{A4) are the transformed tangent vectors in the new geom¬ 
etry given by the time-delay embedding. Then the inner product (•, •)«/. in 
the delay embedding space is 

L 

(Fl,d2)iRi = '^{Dhk{F_ir{u))DF_ir{u)iyi,Dhk{F_ir{u))DF_ir{u)iy2)R 
1=0 
L 

= '^{DF_ir{u)iyi)k{DF_ir{u)u2)k, (13) 

1=0 

where we used the fact that Dhk = (0,..., 0,1, 0,..., 0) and the subscript-/c 
denotes the kth component of the corresponding vector. If ui and i >2 are 
in the m-th Oseledets space, with Lyapunov exponent am, then the inner 
product reduces to, (Fi,z/ 2 )rl = This shows that the 

most stable Oseledets space will dominate the geometry in the embedding 
since cXm < 0 will be most negative (largest in absolute value of the negative 
Lyapunov exponents) in the most stable direction. 
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The bias introduced into the geometry by the delay embedding has sev¬ 
eral advantages for forecasting. First, the most stable components of the 
dynamics are the easiest to predict since trajectories converge very quickly 
when measured in these components. Given an initial condition which is a 
small perturbation of the truth in a stable direction, the perturbation will 
decrease in the forecast, reducing the effect of the initial perturbation. Since 
numerically we will not be able to represent the whole attractor, the variable 
bandwidth kernel will implicitly project away the small components of the 
geometry. By amplifying the stable components we insure that the most 
desirable aspects of the dynamics will be well represented in the discrete rep¬ 
resentation of the geometry. Secondly, when applied to the noisy data, h, the 
noise will correspond to an unstable component, and so the delay embedding 
geometry will de-emphasize the noise component of the data. Formally, con¬ 
sider a noisy observation Vi^k = Ui,k + 6 where are independent identically 
distributed random variables independent of iii^k with ]E[,^j] = 0. When we 
compute the inner product of noisy vectors in the delay embedding space we 
hnd the relative difference, 

L 

1=0 

Since the noise is independent and independent of it.^k all the terms inside 
the sum above have expected value of zero, so by the law of large numbers, 
in the limit as L —)■ cxd we hnd that for i ^ j, 

H(Vi^k^ H(Vj^k) H(^Ujk) 

_ L '^1=0 + ^j-A-l,k + 
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assuming that ^H{ui^kYH{uj^k) = i J2i=o'^i-i,kUj-i,k is nonzero as L —)■ cxo. 
Since we preserve all the pairwise inner product in the embedding space, this 
implies that we preserve the metric up to a constant. This shows how the 
additive observation noise has a smaller impact on the delay embedding than 
the original embedding, especially for large L. 

Finally, note that the delay reconstruction of each mode, it.^k, will cap¬ 
ture a different aspect of the most stable components of the dynamics since 
(^i)fc(i^ 2 )fc corresponds to the projection of the vectors z/i, z /2 into the k-th co¬ 
ordinate direction. Thus, the delay embedded modes Vi^k = {vi,k, ■■■,Vi-L,kY 
represent orthogonal components of the most stable Oseledets directions of 
the dynamics, which in turn are orthogonal to the additive observation noise 
in the limit of large L. 

We now demonstrate the effect of the biased geometry on a simple ex¬ 
ample. We will generate a stochastic dynamics on the unit circle by hrst 
numerically integrating the one-dimensional SDE, 

de = {2 + sm{e))dt + VYldWt, (15) 

and then mapping the intrinsic variable 6 onto the unit circle embedded in 

by the map 9 1 —)■ {x{9),y{9)Y = (cos(6'), sin(6*))^. In this example we use 
discrete time step At = 0.1 to produce 10000 samples of the system (15) 
which are mapped into the plane as {xi,yiY = (cos(6*j),sin(6*j))'''. We then 
generate noisy observations {xi,yiY = {xi,yiY + Vi where the observation 
noise Vi ci-re independently sampled from a two dimensional Gaussian distri¬ 
bution with mean zero and covariance matrix ^/ 2 x 2 - We chose this example 
to be very predictable given the cleaning training data (xj,j/j)^, and the 
small stochastic component is only included so that even the perfect model 
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can only predict for a finite time. Notice that the observation noise in this 
example is very large, as shown in Fignre 1, for example the variance of the 
noise in the x-coordinate is approximately 32% of the variance of the trne 
signal. 

Onr goal in this example is to show how the delay-embedding can re- 
dnce the inflnence of observation noise on a diffnsion model by biasing the 
geometry. We nse the hrst 5000 noisy data points {xi^yiY to train mnlti- 
ple diffnsion models, each time applying a different nnmber of delays before 
bnilding the diffnsion model. We also train a diffnsion model nsing the clean 
data set For each diffnsion model, we apply the Bayesian hlter de¬ 

veloped in Section 2.3 below to generate initial conditions for forecasting from 
the noisy observations in the verihcation period i = 5001,..., 10000. 

Each initial condition in the verihcation period is forecast for a total of 500 
forecast steps (50 time nnits) and the RMSE between the forecast and the 
trnth is averaged over the verihcation period. The RMSE as a fnnction of 
the nnmber of forecast steps for each model is shown in Fignre 1, along with 
the forecast of the x coordinate at the 50 step lead time, compared to the 
trne valne of x. 

While no amonnt of delays is able to match the dihnsion model bnilt ns¬ 
ing the clean data set, the improvement in forecast skill is signihcant as the 
nnmber of delays increases. We shonld note that since this model is intrinsi¬ 
cally one dimensional, the most stable component of the dynamics represents 
the entire dynamical system. This implies that projecting on the most sta¬ 
ble component can only improve the forecasting skill for this example. For 
high-dimensional examples, as the nnmber of delays becomes very high, the 
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projection onto the most stable component will remove information which is 
important to the forecast, leading to reduced forecast skill as shown in Figure 
2 which we will discuss in Section 3 below. Therefore, in general we expect 
to see a tradeoff where small numbers of delays can remove the influence 
of noise, but extremely large numbers of delays may actually degrade the 
forecast by projecting away valuable components of the dynamics. 

2.3. Bayesian filter 

In Step 2 of the forecasting algorithm described in Section 2.1, we assume 
that the initial distributions for forecasting are given. In practice, however, 
we only have the noisy data Vi := {vi,k}keK., and the goal of this section 
is to develop a numerical method to determine the initial distribution of 
the corresponding modes, Ui := {ui^k}keK- We will introduce a Bayesian 
hlter to iteratively combine the information from the noisy data with the 
diffusion forecast. We should note that the noisy observations which we use 
for forecasting are always taken out-of-sample, meaning they are separate 
from the training data. 

If the observations were continuous (i.e. for r —)■ 0), we could hlter the 
noisy observations Vi using the Zakai equation projected onto the basis {(fj}, 
and this was the approach taken in [21]. However, since the observation time 
may be long, we will take a different approach here and apply a Bayesian 
update at discrete hnite time step. Explicitly, we want to hnd the posterior 
density p{ui \ Vs<i) by updating the previous posterior density p{ui-i \ Vs<i-i) 
with the current noisy observation Uj. 

Our Bayesian hlter will follow the standard predictor-corrector approach. 
Given an initial density at the previous time, p(ui-i \ Vs<i-i), the hrst step is 
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to compute the prior forecast density with our nonparametric model, 

M M 

p{Ui\Vs<i-l) = '^Cj{ti)(pj{u)peq{u) = '^{Ac{ti_i)) jif j{u)peq{u), 
i=i i=i 

where the coefficients = (p('Uj_i \vs<i-i),^j) represent the posterior 

density at time projected in diffusion coordinates, pj. Since the noisy 
observation Vi has the form Vi = Ui + rji where rji ~ A/'(0, R), the likelihood 
function for the noisy observation is, 

p{vi I Ui) oc exp(-||hj - Mi|||/2). 

We can now assimilate the observation Vi by using Bayes law to combine the 
prior forecast and the likelihood function above as follows. 


viui I Vs<i) oc p{Ui \Vs<i-l)v{Vi I Ui). 


(16) 


By computing the product (16) we can hnd the desired posterior at time U, 
up to the normalization factor. We estimate the normalization factor, Z, as 
a Monte-Carlo integral. 


N 




1=1 


p{ui I 1 )p{Vi I Ui) 

Peq{Ul) 


p{u\Vs<i-l)p{Vi I u) dV{u), 


'M 


and dividing the product in (16) by Z we recover the posterior density at 
time ti- 

In order to initialize this iterative procedure, we spin up the hltering pro¬ 
cess for few assimilation cycles starting from the invariant measure p{uq) = 
Peq{u). To obtain initial density p{ui\vs<i), which is used for forecasting 
starting at time ti, we run this Bayesian hlter, starting from to allow 
for P steps of spin-up with the invariant distribution as the initial density. 
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p{ui-p I Vs<i-p) = Veqiu). In our numerical simulations below, we used a very- 
short spin-up time of P = 10 steps. 

The method developed above requires only that the observation noise 
is independent and that the likelihood function function p{vi \ Ui) is known. 
The only assumption on the posterior density is that it is well approximated 
in the truncated basis {pj}fLi, which intuitively implies that the posterior 
density is smooth and is not very highly oscillatory. 

3. Forecasting weakly chaotic to fully turbulent modes of Lorenz- 

96 model 

As our test example, we consider Fourier modes of the Lorenz-96 (L96) 
model [26] in various chaotic regimes. The Lorenz-96 model is defined by the 
following forced-dissipative nonlinear system of ODEs, 

^ = {Ui+I - Ui_2)ui-1 - Ui + F, £ = 1,...,40, (17) 

The right hand side of (17) consists of an energy conserving quadratic advective- 
like term, a linear dissipation, and a forcing parameter F. Following [26], 
we resolve the L96 model at 40 equally spaced grid points with a periodic 
boundary (represented by the subscript I being taken modulo 40) in order to 
mimic weather wave patterns on a midlatitude belt. The statistical behavior 
of the Fourier modes of this model has been analyzed in [27]; as the forcing 
parameter F increases, the model becomes fully turbulent with Gaussian-like 
modal distributions. In our numerical experiments we will consider three dif¬ 
ferent values of F following [27]. As reported in [28], for P = 6, the system is 
weakly chaotic with largest Lyapunov exponent Ai = 1.02 and the dimension 
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of the expanding snbspace of the attractor is N~^ = 12. For F = 8, which is 
the original choice in [26], the system is strongly chaotic with Ai = 1.74 and 
= 13. For F = 16, the system is “fnlly tnrbnlent” with Ai = 3.94 and 
= 16. We shonld also note that when F = 6, the L96 model has a longer 
“memory” in the sense that it has a relatively slow decaying time (which is 
dehned as the integral of antocorrelation fnnction, see e.g., [27]), this is man¬ 
ifested visibly as a regnlar dominant westward propagating “weather-like” 
wave pattern of wavennmber-8. The memory becomes shorter as F increases 
and the “weather-like” wave pattern becomes less obvions. 

3.1. Diagnostic models 

In onr nnmerical experiments we will examine the forecasting skill of the 
nonparametric modeling approach on a selection of Fonrier modes, inclnding 
those with high and low energies as well as large and small correlation times. 
To diagnose the forecasting skill, we compare the diffnsion forecast on these 
modes with forecasts from standard statistical approaches when the nnderly- 
ing dynamics are nnknown as well as with the perfect model. In particnlar, 
we will consider: 

3.1.1. Persistence model 

The simplest nonparametric model for predicting the fntnre when the 
nnderlying dynamics are not known is by setting the cnrrent observation, hj, 
as the forecast at each lead time. 

3.1.2. Linear autoregressive models 

A popnlar statistical modeling approach when the nnderlying dynamics 
are not known is to £t the data to a class of linear antoregressive models. 
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In this paper, we consider two models from such a parametric modeling 
approach that were designed previously as cheap hlter models: the Mean 
Stochastic Model (MSM) [29, 17] and the stable constrained autoregressive 
model of order-3 (AR3) [30, 31]. 

The key idea of MSM is to model each Fourier mode as a complex valued 
linear stochastic model with additive white noise, 

du = Xudt + a dWt, (18) 

where Wt is a complex valued Wiener noise with variance t. The parame¬ 
ters A and a are determined by matching the analytical expression for the 
variance and correlation time at the statistical equilibrium state of the MSM 
model in (18) with the corresponding empirically estimated statistics from 
the data (see Chapter 12 of [17] for a detailed formulation). In our implemen¬ 
tation below, we will obtain these parameters from the statistics of the noisy 
dataset, v, during the training period. To generate initial conditions for the 
forecast, we apply a one-dimensional Kalman hlter in order to account for the 
observation noise in v, and we assume that the noise observation variance, 
R, is known. 

The stable and consistent AR3 model was introduced in [30] as a cheap 
linear model for hltering turbulent modes with long memory, such as the L96 
model with F = 6. The standard AR3 model is given by, 

^m+l 2 T ®2^m—1 T (1 T 0‘3^'^m T Vmi Vm ~ A/*(0, ■ (l*^) 

The stability of this model is sensitive to the choice of sampling time At = 
tm+i — tm, especially when a standard linear regression is used to determine 
the model parameters. In [30], it was shown that one can obtain accurate 
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filtered mean estimates in a linear filtering problem with truth generated 
from (18) if the parameters aj are chosen to satisfy a set of linear algebraic 
consistency conditions which depends on A of (18) and the sampling time 
At (see [30] for details). For nonlinear problems, a stable model together 
with these consistency constraints can be determined by an algebraic method 
proposed in [31] and this AR3 model is what we use in the examples below. In 
particular, we will enforce the consistency condition for the AR3 model using 
the parameter A obtained from the MSM £t of the noisy data at the training 
period. To generate initial conditions for the forecast, we apply a three- 
dimensional Kalman hlter to account for the noise observed Vk, assuming 
that the observation noise variance, R, is known (see e.g. [32, 30, 31] for the 
details of the AR hlter). 

3.1.3. Perfect model 

Finally, we also include the forecast of the perfect model in (17). To 
generate the initial conditions, we implement a standard ensemble Kalman 
hlter method [33] with 80 ensemble members, double the dimension of the L96 
model. The perfect model experiment with full data refers to forecasting skill 
where the initial conditions are determined by hltering noisy observations at 
all 40 grid points. 

We also show forecasting skill with a perfect model experiment given only 
observations of noisy modes A = {vi,k}k&K:- In this scenario, the ensemble 
Kalman hlter estimates the initial conditions at unobserved modes which will 
be needed for integrating the full model; obviously, if every mode is observed, 
we are back to the perfect model experiment with full data. Comparing this 
scenario with the dihusion forecast constructed from the same A provides the 
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most objective comparison between the full model and the nonparametric 
diffusion model. In the example below, we will consider examples where K, 
consists of a single mode and the three most energetic modes, K. = {7, 8, 9}. 

3.2. Experimental design 

In our numerical simulations, the true time series are generated by inte¬ 
grating the L96 model with the fourth-order Runge-Kutta method with time 
step 5t = 1/64. Noisy observations are generated by adding independent 
random samples of a Gaussian with mean zero and variance i? = 1 to the 
true solutions at every grid point at each observation time step At = 1/8 
as in (3), resulting in a noise variance of i? = 1/40 on each Fourier mode. 
Given a time series of length 15000, we use the hrst 5000 data points to train 
the three models (diffusion, MSM, and AR3) and we compute the prediction 
skill on the remaining V = 10000 data points. To diagnose the prediction 
skill, we use the standard Root-Mean-Squared Error (RMSE) and anomaly 
pattern correlation skill dehned as follows. 


RMSE{t) = 

\ i=l 

(20) 

PC{t) = 

1 ^ iui{r) - uy{u{{T) - u^) 
y 1 ~ h*| \\u{{t) — u^W ’ 

(21) 


where uj(T) denotes the truth and u{(t) denotes the forecast at lead time 
T. As dehned in [34], the anomalies are dehned with respect to the clima¬ 
tology which is empirically estimated by time average of the truth, v} = 

evaluation below, we will evaluate the forecast skill 
on Fourier modes k ^ IC, where k ^ 0] the climatological means of these 
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nonzero wave numbers are zero. We will also evaluate these measures on the 
40-dimensional spatial domain solutions and in this case, the climatological 
mean v} is nonzero in our example. 

The nonparametric modeling method has two nuisance parameters, the 
number of lags L in the time-delay embedding and the number of diffusion 
modes, M. As discussed in Section 2.2 a large number of lags can reduce 
the influence of independent identically distributed observation noise and 
also biases the geometry towards the stable component of the dynamics. 
In the example in Section 2.2 we saw that a large number of delays could 
signihcantly reduce the influence of observation noise on a one-dimensional 
dynamical system. In Figure 2 we show that for higher-dimensional dynam¬ 
ical systems, there is a limit to the number of lags which can be used to 
reduce the influence of noise. This limit is explained by the projection of the 
geometry onto the most stable component of the dynamics, which for large 
lags will practically eliminate less stable components which also contain in¬ 
formation necessary for forecasting. The number of lags that can be usefully 
included is limited by the size of the most stable Lyapunov exponent, since 
this controls the exponential bias of the delay-embedding towards the most 
stable component as shown in Section 2.2. Notice that in Figure 2, mode-18 
appears to be more sensitive to the number of lags because the noise is much 
larger as a percentage of the variance. In each case the standard deviation 
of the observation noise is 40“^^^ ~ 0.16, and the standard deviation of the 
mode is shown by the RMSE of the corresponding invariant measure. In 
each case the optimal number of lags is between L = 2 and L = 10, which is 
consistent with each mode being a generic observation of the same Lorenz-96 
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dynamical system, and hence each mode has the same most stable Lyapnnov 
exponent. In the remaining nnmerical experiments in this section, we show 
resnlts with L = 4 lags in the time-delay embedding. 

The other nnisance parameter is the choice of the nnmber of diffnsion 
modes, M. In Fignre 3 we show the forecasting skill for varions choices of 
M for an individnal mode-8 for the case of weak tnrbnlence F = 8. Notice 
that while the initial errors increase as fewer modes are nsed, the forecast 
skill at later times are not very different. For deterministic problems, onr 
experience snggests that larger M help improve the prediction skill. For 
stochastic problems, however, the error estimate between A and A derived 
in [9] snggests that a large data set is reqnired to obtain accnrate estimation 
of the eigenfnnctions associated with these high modes. At the moment, a 
systematic method for choosing M is still nnder investigation and for the 
remainder of this paper, we simply set M = 2000 for every mode. Fignres 
2 and 3 indicate that the diffnsion forecast is not very sensitive to these 
nnisance parameters for tnrbnlent modes, and these parameters can be tnned 
nsing cross-validation on the training data set. Also, there is clearly some 
improvement with appropriate choice of lag L relative to no delay coordinate 
embedding. Even with snch empirical choices of parameters {L and M), we 
will see that the forecasting skill is still competitive to those of the perfect 
model when only the modeled modes are observed. 

3.3. Single-mode observations 

In Fignre 4, we show the RMSE and pattern correlation for forecasting 
three different Fonrier modes of the L96 model in a weakly chaotic regime 
with F = 6: mode-8 carries the largest energy with a relatively short correla- 


tion time; mode-13 carries low energy with a relatively long correlation time; 
and mode-18 carries low energy with a relatively short correlation time. Note 
that the ratios between the observation noise variance R= 1/40 and the en¬ 
ergy of the modes are very different for these three modes; 4% for mode-8, 
11.7% for mode-13, and 30% for mode-18. Therefore, we expect that the 
noise does not affect the forecasting skill of modes-8 and 13 as much as that 
of mode-18. 

As expected, the perfect model with full observation provides the best 
forecast on all modes. The diffusion forecast performs very well on all modes, 
outperforming the perfect model when given equal information, meaning 
only observations of the corresponding individual mode (which is referred 
as “perfect sparse obs” in Figure 4) on modes 8 and 18. For these modes, 
both autoregressive models (MSM, AR3) have very little predictive skill. On 
mode-13, the forecasting skill of all four models (AR-3, MSM, perfect sparse 
obs, and diffusion models) are not very different. The persistence model 
always produces severely biased forecasts in the long run, since hxing a par¬ 
ticular observation will always be worse than predicting the statistical mean 
in a long term forecast. 

In Figures 5, we compare the forecast time series of each model to the 
truth for forecast lead time of 2 model time units. This comparison is made 
for mode-8 with F = 6 for verihcation time steps between 1000 and 1100. 
Notice that the diffusion model forecast underestimates many of the peaks 
slightly whereas the perfect model initialized with the corresponding mode 
(perfect partial obs) overestimates many of the peaks. The persistence model 
forecasts are clearly out-of-phase. On the other hand, both the MSM and 
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AR3 models significantly underestimate the amplitude of the signals and in 
the long term they simply forecast zero, as expected since these forecasts 
represent the mean of linear unbiased autoregressive models. 

We also report the forecast skills of the more chaotic regime with F = 8 
in Figure 6 and the fully turbulent regime with F = 16 in Figure 7. Based 
on the RMSE and pattern correlation measures, these results suggest that 
the diffusion model still produces the most skillful forecasts compared to the 
MSM, AR3, the perfect model partial obs and persistence models on mode- 
8. On higher wave numbers, the forecasting skill diminishes as F increases 
and the advantage of the nonparametric model over the parametric models 
become negligible. 

The numerical experiments in this section suggest that the diffusion model 
is competitive with the perfect model when the available data is the corre¬ 
sponding noisy observations of the single mode that is being modeled. We 
also hnd that the diffusion model is most skillful when modeling the highly 
energetic modes. We should note that as F increases, although mode-8 is 
still the most energetic mode, the energy is less dominant relative to the 
other modes in these turbulent regimes [27]. We suspect that this is one of 
the causes of degradation in the forecasting skill when F is larger, in addi¬ 
tion to the more quickly decaying correlation functions in the fully turbulent 
regimes. 

3.4- Multi-modes observations 

Now, we consider forecasting the three most energetic modes, /C = {7, 8, 9}. 
There are two strategies that can be adopted in the diffusion modeling: 1) 
Learning the 3-modes diffusion model directly; 2) Concatenate the three in- 
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dependent one-mode diffusion models (we refer to this strategy as Concate¬ 
nated DM or CDM). Obviously, strategy 1) requires more data for accurate 
estimation since the dimensionality of the underlying dynamics that are be¬ 
ing modeled increases. In our numerical experiment, however, we do not 
increase the amount of data, instead we use the same 5000 data points that 
were used in building the individual one-mode diffusion models. 

In Figure 8, we see that while the perfect model given full data is still the 
best model, when only these three modes data are available, the forecasting 
skill of the perfect model is comparable to the concatenated three-modes 
diffusion model. The concatenated diffusion models (CDM) is slightly bet¬ 
ter than the three-modes DM since the underlying dynamics that are being 
modeled increases in CDM. In this case, these nonparametric approaches 
signihcantly supersede the persistence model. 

Finally, we also report the overall performance of the diffusion forecast 
given data of all Fourier modes in Figure 9. Since constructing the full 
40-dimensional diffusion model is beyond the computational capacity, we 
just show results where we concatenate all 21 independently constructed 
one-mode diffusion models. In this verihcation, the quantifying RMSE (20) 
and pattern correlation (21) are measured in the spatial domain. While the 
diffusion forecast is not expected to be comparable to the perfect model given 
full observation data, one can see that it still produces some skillful forecast 
beyond the persistence model and the skill diminishes as F increases. 

From these numerical experiments, one can see that if the only available 
data is an individual mode or few modes that are being predicted, the fore¬ 
casting skill of the diffusion model is indeed competitive with that of the 
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perfect model. The degrading performance of the perfect model is due to 
inaccurate estimation of initial conditions at unobserved modes. When the 
full data is available, this issue disappears in the perfect model experiment. 
Given all the observations, the nonparametric diffusion model does not have 
capability to represent the full nontrivial interactions of the dynamics that 
can only be achieved given the full model with accurate initial conditions. 
This degradation is totally expected since the nonparametric model is con¬ 
structed with a short time series, and as we mentioned in the introduction, 
even if a large data set is available, the nonparametric model is subject to 
the curse-of-dimensionality. 

4. Forecasting geophysical turbulent modes 

In this section, we consider forecasting turbulent modes of the quasi- 
geostrophic model, a prototypical model for midlatitude atmosphere and 
oceanic dynamics [35]. We consider a two-layer quasigeostrophic (QG) model 
which is externally forced by a mean shear with streamfunctions 

= -Uy, T 2 = Uy, (22) 

such that it exhibits baroclinic instabilities; the properties of the turbulent 
cascade has been extensively discussed in this setting (see e.g., [35, 36] and 
citations in [37]). 

The governing equations for the two-layer QG model with a flat bottom, 
rigid lid, and equal depth H are given by, 

= 0, (23) 

q2) + ^(/^ - kjU) + = 0, (24) 
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where V’ denotes the perturbed streamfunction, subscript 1 corresponds to 
the upper layer and subscript 2 corresponds to the bottom layer. In (23)- 
(24), [3 is the meridional gradient of the Coriolis parameter; k is the Ekman 
bottom drag coefficient; J('0, q) = ^’xQy — 'if’yQx is a Jacobian function which 
acts as nonlinear advection; U is the zonal mean shear, selected so that the 
QG equations exhibit baroclinic instability with a turbulent cascade; u is the 
hyperviscosity coefficient, chosen so that z/V®g filters out the energy buildup 
on smaller scales when finite discretization is enforced. The perturbed QG 
potential vorticity q is defined for each layer by 

g. = vV* + y ('03-i - i’i)- (25) 

The parameter kd = y/S/Ld gives the wavenumber associated with radius 
of deformation, or Rossby radius, Ld (the scale at which Earth’s rotation 
becomes significant to the dynamics of the system). 

In our numerical simulations, the true signal is generated by resolving 
(23)-(24) with 128x64x2 Fourier modes, which corresponds to the 128x 128x 
2 grid points. This model has two important nondimensional parameters: b = 
I3{L/2 'kY/U o, where Uo = 1 is the horizontal non-dimensionalized velocity 
scale and L is the horizontal domain size in both directions (we choose L = 
2n), and F = (L/27r)^/L^, which is inversely proportional to the deformation 
radius [38]. As in [39, 38], we will consider two cases: one with a deformation 
radius Ld such that F = 4 which roughly mimics a turbulent jet in the 
midlatitude atmosphere and the other case will have a deformation radius 
Ld such that F = 40 which roughly mimics a turbulent jet in the ocean. For 
the ocean case, the QG model in (23)-(24) is numerically very stiff since the 
term that involves k‘^ = S/L‘^ = 8F is large. 
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The large-scale components of this turbulent system are barotropic and 
for the two-layer model with equal depth, the barotropic streamfunction is 
dehned as an average between the two layers, -0^ = |('0i + '02) [35, 39]. In 
the atmospheric regime with F = 4, the barotropic flow is dominated by 
a strong zonal jet while in the ocean case with F = 40 there are transitions 
between zonal flow and large-scale Rossby waves (e.g., see [40, 38, 41]). 

In our numerical examples, we build a diffusion model for the two-dimensional 
Fourier modes 0^^^ of the barotropic streamfunction, 'ipi,. Following the ex¬ 
periments in [38, 41], our choice to only model this mode is mainly due to the 
fact that small-scale observations of a turbulent jet are typically not available, 
especially in the ocean. For diagnostic purposes, we compare the diffusion 
model to a simple stochastic model with combined white and colored additive 
and multiplicative noises that was designed as a test hlter model for stochas¬ 
tic parameterization in the presence of model error [18, 19]. The governing 
equation of this model is given as follows, 

dtp = {—jtp + iutp + b) dt + a dWt, 
db = —{Xbb — b)dt + (Tf, dWb^ti (26) 

dy = — (A.y7 — 7) dt-|- 

where variable tp models the Fourier mode of tpk^i, dropping the horizontal 
and vertical wave components {k, i) to simplify the notation. The equation 
governing tp represents the interactions between this resolved mode and the 
remaining unresolved modes by several noise terms. The hrst noise term, 7, 
is a real valued multiplicative noise. The second noise term is b, which is 
an additive complex valued noise governed by the Ornstein-Uhlenbeck mean 
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reverting SDE. The final noise term is adWt, which is an additive white noise. 
In (26), the stochastic terms, Wt, Wb^t, are complex valued and are real 
valued Wiener processes. We should note that while the simple system in (26) 
is nonlinear (due to the multiplicative term —'j'lp), it has analytical statistical 
solutions [18, 19] which we will utilize in our numerical experiments here. 

In [18, 19, 17], they implemented a Kalman update on the analytical mean 
and covariance solutions of (26) and called this filtering method SPEKF 
[18, 19, 17], which stands for Stochastic Parameterized Extended Kalman 
Filter. In our implementation below, we will use SPEKF to generate ini¬ 
tial conditions for our forecast at the verification times. As in any standard 
parametric modeling approach, one of the main issue is in choosing param¬ 
eters {cj, (T, Ab, 6, (Tfe, A-y, 7 , cr^} in (26) to produce high forecasting skill. We 
use parameters that are tuned to optimize the filtered solutions as in [38]. 
We found that this set of parameters produce better forecast compared to 
that obtained by an adaptive estimation method such as [42] (results are not 
shown). Even though the parameters are empirically chosen, we shall see 
that the forecasting skill of the SPEKF model can be competitive with the 
diffusion forecast in some regimes. 

In our numerical experiments, we generate a time series of 9000 data 
points at discrete time step At = 1/4 and added Gaussian noise in the 
physical space at 36 regularly spaced grid points in the spatial domain as in 
[38] with noise variance R = 25%Var{'ijj^). In Fourier space, the observation 
error variance is R = i?/36; in Figure 10, we show R relative to the variance 
of the 12 modes corresponding to the 36 regularly spaced observed grid points 
for both the atmospheric and ocean regimes. We use the first 5000 noisy data 
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points to train the diffusion model and the remaining 4000 data points to 
compute the forecasting skill measures. For the diffusion forecast model, we 
empirically choose L = 9 for the atmospheric regime and L = 4 for the ocean 
regime. In all experiments, we use M = 2000 diffusion modes. Note that one 
can compare the results with the perfect model forecasts, filtering the same 
36 observations to generate initial conditions. In previous work [38] it was 
shown that while the initial conditions for the atmospheric regime can be 
accurately specified with an appropriate EnKF scheme, in the ocean regime 
the initial conditions are not easily estimated due to numerical stiffness. Since 
hnding accurate initial conditions for the perfect model is not the main point 
this paper, we neglect showing the perfect model experiment in this section. 
Instead, we just compare it with the truth which readily represents the best 
forecasting scenario. 

In Figures 11, we show the RMSE and pattern correlation for the first 
two energetic modes in the atmospheric regime. Here, the first two energetic 
modes correspond to the two-dimensional horizontal wave numbers (0,1) and 
(1,1) with explained variances of about 65% and 73%, respectively, so the 
behavior is dominated by the zonal jet in the horizontal direction [38]. In 
this regime, notice that the persistence model produces high forecasting skill 
with a very long correlation for the first mode. However, the persistence 
model is not useful in forecasting the second mode. In contrast, the para¬ 
metric SPEKF model produces biased long-time forecast on the first mode 
but more accurate forecast on the second mode. In this regime, the diffu¬ 
sion forecasting skill is quite robust for both modes with best RMSE and 
PC scores. In Figure 12, we show the corresponding forecast time series of 
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the real component of these two most energetic modes at lead times 0, 4, 
and 8 model time units. Notice that the Bayesian hlter works reasonably 
well, indicated by the accurate recovery at lead time-0. The forecast skills 
of SPEKF and diffusion forecast for the hrst mode are relatively similar and 
very skillful at lead times 4 and 8 with PC high scores, above 0.9. For the 
second mode, on the other hand, one can see the forecasts of both models 
become less accurate at these lead times with very low PC, less than 0.5. 
Here, notice that SPFKF forecasts are nearly zero. 

In Figures 13, we show the RMSE and pattern correlation for the hrst 
two energetic modes in the ocean regime. Here, the hrst two energetic modes 
correspond to the two-dimensional horizontal wave numbers (1, 0) and (0,1) 
with explained variance of about 50% and 97%, respectively, so the behavior 
is dominated by two competing modes, the Rossby mode (1, 0) and the zonal 
jet mode (0,1) [38]. In this regime, the persistence model forecast has no 
skill at all on the hrst mode. On the second mode, however, the persistence 
forecast is quite reliable in the short term. The dihusion forecast is better 
than SPEKF on the hrst mode, but slightly worst than SPEKF on the second 
mode (indicated by worse PC score beyond lead time 2). In Figures 14, we 
show the corresponding forecast time series of the real component of these 
two most energetic modes for lead times 0, 1, and 2. At these lead times, 
where the PC score is at least 0.6, we expect the forecasting skill to be 
comparable on the hrst mode while SPEKF should be slightly better than 
the dihusion forecast on the second mode. 

In Figure 15, we report the RMSE and pattern correlation computed over 
the 6x6 observed grid points; obtained by inverse Fourier transforming 12 
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independent models for the corresponding modes. In terms of filtering, notice 
that the estimates from both the SPEKF and diffusion models are much more 
accurate compare to the estimates from the persistence model whose error 
is nothing but the observation noise standard deviation. In both regimes, 
the overall performances in terms of RMSE and PC scores from the best 
to the worst are the Diffusion model, SPEKF, and the persistence model. 
Notice that the forecasting skill of the diffusion model and SPEKF are com¬ 
petitive in the very short term forecast. In the long term, however, while 
the RMSE of the diffusion forecast always converges to the climatological 
standard deviation, SPEKF RMSE exceeds this climatological standard de¬ 
viation, indicating a biased forecast. From the simulations in this section on 
a nontrivial prototypical example of midlatitude geophysical turbulent flows, 
we see that the proposed non-parametric diffusion models produce robust, 
skillful, and unbiased long term forecasts. 

5. Summary 

In this paper, we applied the nonparametric diffusion modeling approach 
proposed in [9] to forecast Fourier modes of turbulent dynamical systems 
given a hnite set of noisy observations of these modes. Motivated by the 
results in [13, 14, 15], we built the nonparametric model in the delay embed¬ 
ding coordinates in order to account for the interaction between the resolved 
and unresolved modes as well as to smooth out the noisy observations. We 
empirically verify that larger L helps to reduce the influence of the obser¬ 
vation noise, however our theoretical derivation suggests that taking L too 
large will project away less stable components of the dynamics. The choice of 
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L should not be so large as to project away the unresolved modes, but large 
enough to smooth out the noise in the training data set, and currently this 
selection requires empirical tuning. To initiate each forecast, we introduced 
a simple Bayesian hltering method which accounts for the information of 
the noisy observations up to the current time. We compared the forecasting 
skills of the proposed nonparametric model with various standard modeling 
approaches when the underlying dynamics are unknown, including the per¬ 
sistence model, linear autoregressive models of order-1 (MSM) and order-3, 
the SPEKF model which involves combined additive and multiplicative noise 
terms. We also compared the diffusion forecast to the perfect model forecast 
in the L96 example above. In this example, we found that when the only 
available observations are the low-dimensional set of the modes that are be¬ 
ing modeled, then the diffusion forecasting method is competitive with the 
perfect model. When full data (all modes) becomes available, the curse-of- 
dimensionality limits the ability of diffusion model to represent nontrivial 
interactions of the underlying dynamics that can only be achieved given the 
perfect model and accurate initial conditions. Overcoming this limitation of 
the diffusion forecast is a signihcant challenge which we hope to address in 
future work. From the QG experiments, we found that while the short-term 
forecast of the parametric model, SPFKF, is competitive with that of the 
diffusion model, the latter produces more robust and unbiased long term 
forecasts. 

One important thing to take away from this paper is that although the 
proposed nonparametric model seems to be competitive with the standard 
parametric models simulated here and even with the perfect model with par- 


39 



tially observed data, we do not claim that this equation-free approach is 
the model to use for everything. If one knows the physics of the underlying 
dynamics (or the appropriate reduced parametric models), then one should 
use the physics-based model rather than the equation-free diffusion models. 
Indeed, it was shown rigorously in a simple setup that optimal hltering and 
accurate statistical prediction can be achieved with an appropriate stochastic 
parametric modeling of the unresolved scales [8]. However, there are at least 
two valuable aspects of the nonparametric model we develop here. First, 
one can use this model to diagnose whether his/her modeling approach is 
appropriate; we expect that appropriate physics-based models should beat 
this black box approach. Second, when an appropriate, physically motivated 
model is not available, then this approach will often outperform adhoc para¬ 
metric models as demonstrated in this paper. Moreover, in practice it is 
difficult to guess the appropriate choice of parametric models even if some 
physics-based knowledge has been imposed in deriving a reduced model (e.g., 
in turbulent closure problems, one typically uses parameters to represent 
some small-scale processes or some physical processes which are not well un¬ 
derstood). In this situation, we propose to extract the time series of these 
processes (if possible) and to subsequently build nonparametric models for 
such processes. This idea is what we refer to as the semi-parametric modeling 
approach and is introduced in the companion paper [43]. 
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Appendix A. Variable Bandwidth Diffnsion Map Algorithm 

For data {xi}f^i lying on a smooth manifold M. (with L = 0 lags), or 
for data given by a generic observation of a dynamical system on a smooth 
manifold (with L sufficiently large), the algorithm given below will provably 
estimate the eigenfunctions (pj of the generator C of (5). The estimate L of 
the operator 

C = -VH-V +A, 

where U = — log(peg), is up to a pointwise error of order O ^e, i^li/ 2 +d/A ^ 

where q{x) denotes the sample distribution, which by ergodicity is exactly 
the invariant measure peq of the underlying dynamics. We note that C 2 = 
d(l/4 — d/2) < 0 for the choice jS = —1/2 and a = —<7/4 so that the £- 
nal error term is bounded even when q{xi) is arbitrarily close to zero. For a 
derivation of these facts see [12] and for a brief overview see the supplemental 
material of [9]. Below, we only give a step-by-step cookbook for constructing 
the eigenfunctions {p>j} of t. 

1. Choose delay weight k > 0 and number of lags L. In our numerical 
experiments, we use k = 0. 
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2. Let Xi = x{ti) G M” be a time series with i = 1,iV + L data points. 
In our case, the data set are noisy Fourier modes Vi^k- 

3. For i = 1 + L,iV + L form the state vector, 


yi = [xi,e ^Xi-i,...,e Xi-Lf e 




4. For each i hnd the /c-nearest neighbors of Hi in let their indices 

be for j = 1, ...,k ordered by increasing distance. Here we used 

k = 512. 

5. Form a sparse {N — L) x {N — L) matrix with {N — L)k nonzero entries 
given by 

6 . Dehne the ad hoc bandwidth function pi = \jY^j=i d{i-, 

7. Automatically tune the bandwidth for the kernel density estimate. 

(a) Let ei = 2' for I = -30, -29.9,..., 9.9,10. 


(b) Compute Ti = E*p=iexp 

(c) Estimate the local power law T/ = e“ at each / by a/ = ^ ■ 

(d) Choose e = argmax^^la^}. 

(e) Estimate the intrinsic dimension d = 2max,;j{a;}. 

8 . Form the density estimate g* = q{xi) = ^xp 


-djij) 


logTi-logTi_i 


9. Dehne the bandwidth function p* = gf. We use (3 = —1/2 and a = 
-d/A. 


10. Repeat step 7 to choose the bandwidth e with Ti = Ei^=i ^xp \ 
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11. Form the sparse kernel matrix K{i, j)) = exp j • 

12. Form the symmetric matrix = (K + K'^)/2. 

13. Form the diagonal normalization matrix Da = ■ 

14. Normalize to form matrix D~^. 

15. Form the diagonal normalization matrix {Da)ii = 

16. Form the diagonal matrices Du = 2eqf^ and P = . 

17. Form the symmetric matrix L = P~^K^P~^ — D~^. 

18. Find the smallest magnitude eigenvalues \j and associated eigenvectors 
(pj of L. 

19. Normalize the eigenvectors so that ^ = 1 - 
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Figure 1: For training data with a significant amount of additive observation 
noise (top), the effect of increasing the number of lags on the diffusion model 
learned from noisy observations is shown in terms of the forecast RMSE 
(bottom, left) and the 50 step lead time forecast (bottom, right). 
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Figure 2: Forecasting skill (RMSE) of the diffusion model in predicting mode- 
8 (left) and mode-18 (right) of L-96 model with F = 6 constructed using 
various choices of L. 
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Figure 3: Forecasting skill of the diffusion model in predicting mode-8 of 
L-96 model with F = 6 constructed using various choices of M. 
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Figure 4: Forecasting skills of L96 models with F = 6 for modes 8 (first 
row), 13 (second row), and 18 (third row) in terms of RMSE as dehned in 
(20) (first column) and pattern correction in (21) (second column). 
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Figure 5: Mean forecast estimates for mode-8 of F = 6 at lead times 2 model 
time unit at the verification period of [1000,1100]. The truth (black dashes) 
and estimates (gray solid). 
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Figure 6: Forecasting skills of L96 models with F = 8 for modes 8 (first 
row), 13 (second row), and 18 (third row) in terms of RMSE as dehned in 
(20) (first column) and pattern correKltion in (21) (second column). 
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Figure 7: Forecasting skills of L96 models with F = 16 for modes 8 (first 
row), 13 (second row), and 18 (third row) in terms of RMSE as dehned in 
(20) (first column) and pattern correction in (21) (second column). 
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Figure 8: Forecasting skills of L96 models for the three most energetic mode 
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Figure 9: Forecasting skills of L96 models given full data. 
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Figure 10: Observation error variance relative to the variances of the 12 
Fourier modes corresponding to the 36 regularly spaced observed grid points. 
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Figure 11: Forecasting skills of the QG models for the two most energetic 
modes for the atmospheric regime with F = 4 in terms of RMSE as dehned 
in (20) (hrst column) and pattern correlation in (21) (second column). 
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Figure 12: Forecasts of the QG models for the real component of the two 
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most energetic modes for the atmospheric regime with F = 4 at lead times 
0,4,8 on verihcation interval [3000,4000] for mode 1 and [3000,3100] for 
mode 2. 
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Figure 13: Forecasting skills of the QG models for the two most energetic 
modes for the ocean regime with F = 40 in terms of RMSE as dehned in 
(20) (hrst column) and pattern correlation in (21) (second column). 
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Figure 14: Forecasts of the QG morals for the real component of the two 
most energetic modes for the ocean regime with F = 40 at lead times 0,1, 2 
on the verihcation interval [900,1000]. 
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Figure 15: RMSE and PC scores for the atmospheric (top) and ocean (bot¬ 
tom) regimes, computed over the 6x6 observed grid points and 4000 verifi¬ 
cation times. 
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